CaitlinRowley - Blog Post 5

Author

Caitlin Rowley

I have switched data sets for future blog posts. I will be using open-ended survey responses from a 1996 study titled Survey of Gun Owners in the United States.

For the study, respondents were asked six qualifying questions related to: (1) gun ownership, (2) gun-carrying practices, (3) gun display against the respondent, (4) gun use in self-defense against animals, (5) gun use in self-defense against people, and (6) other weapons used in self-defense. A “yes” response to a qualifying question led to a series of additional questions on the same topic as the qualifying question.

The open-ended responses include descriptions specifically related to the following questions: (1) where the respondent was when he or she displayed a gun (in self-defense or otherwise), (2) specific reasons why the respondent displayed a gun, (3) how the other individual reacted when the respondent displayed the gun, (4) how the individual knew the respondent had a gun, (5) whether the police were contacted for specific self-defense events, and (6) if not, why not.

I will focus on the following research question: Can we identify the most common circumstance in which respondents displayed a gun? I am hoping that “circumstance” can include references to both the catalyst in the situation and the environment.

# install packages:

install.packages("RColorBrewer")
Installing package into 'C:/Users/caitr/AppData/Local/R/win-library/4.2'
(as 'lib' is unspecified)
Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
install.packages("stopwords")
Installing package into 'C:/Users/caitr/AppData/Local/R/win-library/4.2'
(as 'lib' is unspecified)
Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
# load libraries: 

library(readr)
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.2.2
── Attaching packages
───────────────────────────────────────
tidyverse 1.3.2 ──
✔ ggplot2 3.3.6      ✔ dplyr   1.0.10
✔ tibble  3.1.8      ✔ stringr 1.4.1 
✔ tidyr   1.2.1      ✔ forcats 0.5.2 
✔ purrr   0.3.4      
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(dplyr)
library(quanteda)
Warning in .recacheSubclasses(def@className, def, env): undefined subclass
"unpackedMatrix" of class "mMatrix"; definition not updated
Warning in .recacheSubclasses(def@className, def, env): undefined subclass
"unpackedMatrix" of class "replValueSp"; definition not updated
Package version: 3.2.3
Unicode version: 13.0
ICU version: 69.1
Parallel computing: 4 of 4 threads used.
See https://quanteda.io for tutorials and examples.
library(magrittr)

Attaching package: 'magrittr'

The following object is masked from 'package:purrr':

    set_names

The following object is masked from 'package:tidyr':

    extract
library(RColorBrewer)
library(wordcloud)
library(tidytext)
Warning: package 'tidytext' was built under R version 4.2.2
library(stopwords)
Warning: package 'stopwords' was built under R version 4.2.2
library(tm)
Loading required package: NLP

Attaching package: 'NLP'

The following objects are masked from 'package:quanteda':

    meta, meta<-

The following object is masked from 'package:ggplot2':

    annotate


Attaching package: 'tm'

The following object is masked from 'package:stopwords':

    stopwords

The following object is masked from 'package:quanteda':

    stopwords
library(quanteda.textplots)
library(tokenizers)

# read in data:

Survey <- read_csv("C:\\Users\\caitr\\OneDrive\\Documents\\DACSS\\DACSS 679\\DACSS 697\\DACSS 697 - Survey.csv", col_names = TRUE)
Rows: 814 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Q#, VERBATIM RESPONSE
dbl (2): ID#, CODE

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#rename columns:

names(Survey) <- c('Respondent_ID','Survey_Question', 'Code', 'Open_Ended_Response')
print(Survey)
# A tibble: 814 × 4
   Respondent_ID Survey_Question  Code Open_Ended_Response                      
           <dbl> <chr>           <dbl> <chr>                                    
 1           250 12                  0 "I had it in my vehicle to go hunting, n…
 2           248 28                  0 "None of the above.  I was an employee w…
 3           250 6                   1 "Four of them."                          
 4           250 6                   1 "Four of them."                          
 5           250 7                   1 "My shotgun."                            
 6           250 7                   1 "My shotgun."                            
 7           428 7                   1 "{He said \"no\" to fully automatic weap…
 8           409 8                   1 "If they were used incorrectly."         
 9          1165 9                   1 "Well, yeah, community property."        
10           250 12                  1 "I had it in my vehicle to go hunting, n…
# … with 804 more rows
# remove duplicate observations: 

Survey_unique <- distinct(Survey)

Survey_unique %>% 
    separate_rows(Open_Ended_Response) %>% 
    distinct() %>%
    nrow
[1] 8134

There are 812 rows in this data frame and 4 columns. The four columns represent the following variables: survey respondent ID (‘Respondent_ID), survey question (’Survey_Question’), response code (‘Code’), and verbatim open-ended responses (‘Open_Ended_Responses’). Each of the 812 rows now represents a unique observation. Additionally, there are 8134 unique words across all open-ended responses.

I will next create a corpus and remove both capitalization and punctuation from all open-ended responses.

# create a corpus:

Survey_corpus <- corpus(Survey_unique$Open_Ended_Response)
head(Survey_corpus)
Corpus consisting of 6 documents.
text1 :
"I had it in my vehicle to go hunting, not on my person."

text2 :
"None of the above.  I was an employee working in a convenien..."

text3 :
"Four of them."

text4 :
"My shotgun."

text5 :
"{He said "no" to fully automatic weapons but "yes" to semi-a..."

text6 :
"If they were used incorrectly."
# tokenize, remove capitalization and punctuation:

Survey_tokens <- tokens(Survey_corpus, 
    remove_punct = T)
Survey_tokens <- tokens_tolower(Survey_tokens)
head(Survey_tokens)
Tokens consisting of 6 documents.
text1 :
 [1] "i"       "had"     "it"      "in"      "my"      "vehicle" "to"     
 [8] "go"      "hunting" "not"     "on"      "my"     
[ ... and 1 more ]

text2 :
 [1] "none"        "of"          "the"         "above"       "i"          
 [6] "was"         "an"          "employee"    "working"     "in"         
[11] "a"           "convenience"
[ ... and 1 more ]

text3 :
[1] "four" "of"   "them"

text4 :
[1] "my"      "shotgun"

text5 :
 [1] "he"             "said"           "no"             "to"            
 [5] "fully"          "automatic"      "weapons"        "but"           
 [9] "yes"            "to"             "semi-automatic" "he"            
[ ... and 28 more ]

text6 :
[1] "if"          "they"        "were"        "used"        "incorrectly"
# remove stopwords separately: 

Survey_tokens_stopwords <- tokens_select(Survey_tokens,
                                       pattern = stopwords("en"),
                                       selection = "remove")

I will next extract features from my corpus by creating a document feature matrix. From there, I will be able to generate data visualizations.

# create document feature matrix:

Survey_tokens <- dfm(Survey_tokens_stopwords)

# identify 10 most common words:

topfeatures(Survey_dfm, 10)
Error in topfeatures(Survey_dfm, 10): object 'Survey_dfm' not found
# create a wordcloud:

textplot_wordcloud(Survey_dfm, min_count = 5, max_words = 50, random_order = FALSE)
Error in textplot_wordcloud(Survey_dfm, min_count = 5, max_words = 50, : object 'Survey_dfm' not found

Next, I will generate a feature co-occurrence matrix:

# create a dfm that is limited to words that appear frequently (more than 30% of responses).

Survey_dfm_freq <- dfm_trim(Survey_dfm, min_termfreq = 30)
Error in dfm_trim(Survey_dfm, min_termfreq = 30): object 'Survey_dfm' not found
Survey_dfm_freq <- dfm_trim(Survey_dfm_freq, min_docfreq = .3, docfreq_type = "prop")
Error in dfm_trim(Survey_dfm_freq, min_docfreq = 0.3, docfreq_type = "prop"): object 'Survey_dfm_freq' not found
# use dfm to create a feature co-occurring matrix:

Survey_fcm <- fcm(Survey_dfm_freq)
Error in fcm(Survey_dfm_freq): object 'Survey_dfm_freq' not found
# check dimensions:

dim(Survey_fcm)
Error in eval(expr, envir, enclos): object 'Survey_fcm' not found

QUESTION: The first time I ran this chunk of code, the dimensions of the feature co-occurring matrix were shown to be 395 rows by 395 columns. After re-running the codes after reopening R, the dimensions are now showing as 0 rows by 0 columns.

______________________

Next, I will apply the correlated topic modeling approach to my data set, since I do not have a covariate such as sentiment.

# install packages:

library(stm)
Warning: package 'stm' was built under R version 4.2.2
stm v1.3.6 successfully loaded. See ?stm for help. 
 Papers, resources, and other materials at structuraltopicmodel.com
library(quanteda)

# generate data frame:

Survey_dfm_clean <- dfm(Survey_dfm)
Error in dfm(Survey_dfm): object 'Survey_dfm' not found
             tolower = TRUE
             remove = stopwords("en")
             remove_punct = TRUE
dim(Survey_dfm_clean)
Error in eval(expr, envir, enclos): object 'Survey_dfm_clean' not found

The dimensions of my data frame are 812 rows and 1224 columns. Next, I will generate

# generate correlated topic model:

cor_topic_model <- stm(Survey_dfm, K = 5,
                       verbose = FALSE, init.type = "Spectral")
Error in asSTMCorpus(documents, vocab, data): object 'Survey_dfm' not found
cor_topic_model
Error in eval(expr, envir, enclos): object 'cor_topic_model' not found
summary(cor_topic_model)
Error in summary(cor_topic_model): object 'cor_topic_model' not found
# label topics:
# frex = words that are both frequent and exclusive to topic.
# lift = calculated by dividing the topic-word distribution by the empirical word count probability distribution.

labelTopics(cor_topic_model)
Error in labelTopics(cor_topic_model): object 'cor_topic_model' not found

I now have a topic model with 5 topics, 805 documents, and a 1224-word dictionary.

# identify document most frequently associated with the five topics:

findThoughts(cor_topic_model,
             texts = Survey_dfm$Open_Ended_Responses,
             topics = c(1:5),
             n = 1)
Error in findThoughts(cor_topic_model, texts = Survey_dfm$Open_Ended_Responses, : object 'cor_topic_model' not found

I am having a little trouble with this code; my understanding is that it’s intended to yield the document most frequently associated with the five topics, but I am seeing a different, less informative output. I will work more on this.

QUESTION: Do I need to identify question numbers associated with location/reason for drawing a weapon and apply this as a covariate? Do I need a covariate? If so, how do I identify one when all open-ended responses are aggregated? I.e., the ‘Survey_Question’ variable is the only indicator for response categories, so there isn’t necessarily another variable to choose from in terms of identifying a predictor.

______________

Try structural topic modeling (without predictor). My thought is that these results may look similar to the correlation topic model since I am not adding a predictor.

# choose the number of topics:

k <- 5

# specify model:

Survey_STM <- stm(Survey_dfm,
               K = k,
               data = Survey_dfm$Open_Ended_Responses,
               max.em.its = 1224,
               seed = 1234,
               init.type = "Spectral")
Error in asSTMCorpus(documents, vocab, data): object 'Survey_dfm' not found
labelTopics(Survey_STM)
Error in labelTopics(Survey_STM): object 'Survey_STM' not found

Try some visualization to capture the estimated frequency of words across topics:

plot(Survey_STM, type = "summary")
Error in plot(Survey_STM, type = "summary"): object 'Survey_STM' not found

Next, I want to try extracting topics and assigning them to the vector of document proportions. Extract the top words (as identified by ‘frex’), collapse the strings, and separate the tokens:

# get the words:

Survey_words <- labelTopics(Survey_STM, n=5)$frex
Error in labelTopics(Survey_STM, n = 5): object 'Survey_STM' not found
# set up an empty vector:

Survey_topic_labels <- rep(NA, k)

# set up a loop to go through the topics and collapse the words to a single label:

for(i in 1:k){Survey_topic_labels[i] <- paste(Survey_words[i,], collapse = "_")}
Error in paste(Survey_words[i, ], collapse = "_"): object 'Survey_words' not found
# print the labels:

Survey_topic_labels
[1] NA NA NA NA NA

Without a predictor or covariate, I am unable to measure the effect on topic distributions. Identifying a way to incorporate a covariate or predictor will be my next task.

I may also try a K-Means analysis.